Our code can be found in the following GitHub repo: https://github.com/katallzxc/mie1517
Our final code notebooks (before merging notebooks to create this html file) can be found in the following folder in the repo: https://github.com/katallzxc/mie1517/tree/main/FINAL
This notebook contains select components of our code with annotation and discussion. To see the code only, check out this notebook: https://github.com/katallzxc/mie1517/blob/main/FINAL/Final_Demo.ipynb
For our project, we chose to tackle the problem of fixing group photos where one or more people are not smiling. We’ll discuss the many datasets that we used as well as our general approach; but first, a little motivation.
We all know the frustrations of trying to successfully take a group photo or family photo. These pictures can take dozens of tries to ensure that no one happens to be frowning, sneezing, coughing, or otherwise unsmiling. When there are kids involved, things can get even worse as we try to keep all kids happy and beaming for the duration of the photo session. These frustrations can ruin rare opportunities to take a photo together with friends and family that you don’t see often or spoil expensive and time-consuming holiday photo opportunities.
We decided to try to create a system that solves these issues and mitigates the painful process of taking successful group photos. Specifically, we decided to build a system that can recognize unsmiling faces and transfer smiles onto those faces. This project interested us because it is widely applicable, solves a very relatable problem, and is reasonably challenging.
For instance, in this photo taken from: https://www.twenty20.com/photos/bf9e88e9-a5ab-440e-b633-620e9c7e3d5a our goal would be to transfer smiles onto the faces of the two children, but the parent's face can be left alone since he is already smiling.
In class, we learned how to classify objects or beings in images using CNNs. We also learned how to generate new images with subtle changes using GANs. To tackle the process of fixing group photos, we framed group photos as a collection of photos of individual faces, with the number and location of faces in the photo unknown and the emotion on those faces (smiling or not smiling) also unknown. We theorized that we could use some means of face detection like the Viola-Jones algorithm covered in class to extract each of these individual faces as a subphoto. With each subphoto extracted, we could then determine which faces were smiling and which were not smiling. The smiling faces could be left alone, but the unsmiling faces require modification. For each unsmiling face, we decided to use a GAN method to generate a smile on the face of the unsmiling person. We then would insert each generated smiling face back into the photo at the location of the original unsmiling face.
For our sample photo with the two children, our final workflow followed the diagram shown below. We'll discuss the data used for training and testing as well as the specifics of our model architecture for each network involved in the following sections.
A few general notes about our model structure to make the approach to data collection and preprocessing clear:
Our data preprocessing section will hence talk mostly about data for the classifier.
The code blocks below show the imports used in this project as well as a couple of helper functions used for importing data from Google Drive.
# import libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot as plt
import torchvision.transforms as transforms
from torchvision import datasets
import torchvision.models
from torchvision.io import read_image
from torch.utils.data import Dataset, DataLoader
import re
import tarfile
import matplotlib.image as image
from PIL import Image
!pip install face_recognition
import face_recognition
from google.colab import drive
drive.mount('/content/gdrive',force_remount=True)
def unzip_tar_file(source_path,destination_path):
'''
Read contents of .tar file into memory in Google Colab.
'''
zip_ref = tarfile.TarFile(source_path, 'r')
zip_ref.extractall(destination_path)
zip_ref.close()
def load_images_from_folder(folder,s,cnn = False):
'''
Read data images and their expression label (integer) in from folder.
'''
# get list of image filenames
images = []
labels = []
imagefolder = folder+'/images'
k = os.listdir(imagefolder)[:s]
# read in all images and labels from folder
for filename in k:
# add image to list
img = plt.imread(os.path.join(folder+'/images',filename))
if img is not None:
images.append(img)
# get filename for label for this image ('exp' files give expression label as int)
number = int(re.search(r'\d+', filename)[0])
label_filename = folder+'/annotations/'+str(number)+'_exp.npy'
# get expression label and add to list, or label as "not face" and warn if label not found
if os.path.isfile(label_filename):
label = np.load(label_filename)
else:
label = 10
print("Label for image %s not found"%filename)
labels.append(int(label))
images = np.array(images)
print("Shape of images is ",np.shape(images))
labels = np.array(labels)
print("Shape of labels is ",np.shape(labels))
if cnn:
images = np.transpose(torch.tensor(images),[0,3,2,1])
trainfeature= resnet(images/255)
trainfeature = [x.clone().detach() for x in trainfeature]
data = zip(trainfeature, labels)
data = list(data)
else:
images = np.transpose(torch.tensor(images),[0,3,2,1])
images = images/255
images = [x.clone().detach() for x in images]
data = zip(images, labels)
data = list(data)
return data,images
While we ended up using a premade packaged for face recognition, we did initially try to implement the Viola-Jones algorithm and hence needed to collect images with faces as well as images that were guaranteed to not have any faces present. To do so, we combined images from about 30 categories of the Caltech-256 dataset and manually reviewed the data to reject images with real or cartoon faces. We annotated this set with the label “no_face” and annotated images from the AffectNet dataset with the label “face” to create training and testing splits for the face detection algorithm. Images from the Images of Groups dataset were also collected for testing of this algorithm.
The detection data .tar files are linked in this folder: https://drive.google.com/drive/folders/10Ioe5haOScC4bsXRcTDEmVykgpm6nyrn?usp=sharing
You would need to add the .tar files to your Google Drive and change the Drive url accordingly in the following section to import this data.
group_path = '/content/gdrive/MyDrive/Colab Notebooks/Project/data/detection_data.tar'
dest_path = '/content'
unzip_tar_file(group_path, dest_path)
group_path = '/content/gdrive/MyDrive/Colab Notebooks/Project/data/detection_data_test.tar'
unzip_tar_file(group_path, dest_path)
Some sample "face" and "not face" data elements are displayed below using helper functions to display images in a grid.
# define function to load data from Drive using ImageFolder
def label_flip(label):
return int(not label)
def get_data(data_dir):
transform = transforms.Compose([
transforms.Resize(100),
transforms.CenterCrop(100),
transforms.ToTensor()
])
dataset = datasets.ImageFolder(data_dir, transform=transform, target_transform=label_flip)
return dataset
# define function to display data
def display_data(dataset,num_imgs,start_ind=0):
# get neat number of images to plot
divisor1 = 1
for i in range(1,int(num_imgs/2)+1):
if num_imgs % i == 0:
divisor1 = i
divisor2 = int(num_imgs/divisor1)
# plot images in grid
for k in range(0,num_imgs):
plt.subplot(divisor2, divisor1, k+1)
plt.axis('off')
cur_img = dataset[k+start_ind][0].numpy()
cur_img = np.clip(cur_img,a_min = 0, a_max = 1)
cur_img = np.transpose(cur_img, (1, 2, 0))
plt.imshow(cur_img)
plt.title("Label="+str(dataset[k+start_ind][1]))
full_set = get_data('detection_data')
print("Length of facial recognition dataset is: ",len(full_set))
# visualize a sample from each set to check
print("Sample face data:")
display_data(full_set,15)
print("Sample non-face data:")
display_data(full_set,15,int(len(full_set)/2+1))
To train our classifier model, we first need to import the training data.
Our main source of categorized pictures of faces was AffectNet, a massive dataset of labelled RGB facial expression images. The labels for this dataset were integers that represented eight primary emotions including “happy”, “sad”, “neutral”, and other expression categories. We used data from these categories for training and validation data.
Facial detection and expression classification are challenging tasks that require large amounts of data, so we decided to collect data from multiple sources to augment our main AffectNet dataset. We collected some other datasets with greyscale images with labelled expressions, namely, the Facial Expression Recognition (FER), Extended Cohn-Kanade (CKPLUS), and Karolinska Directed Emotional Faces (KDEF) datasets. We decided to use this additional data to test our classifier after training with AffectNet to ensure that the test data was totally new to the model. We also used Google search results for pictures of faces in some cases, particularly when looking for group photos.
The training and validation data .tar files are linked in this folder: https://drive.google.com/drive/folders/10Ioe5haOScC4bsXRcTDEmVykgpm6nyrn?usp=sharing
You would need to add the .tar files to your Google Drive and change the Drive url accordingly in the following section to import this data.
train_path = '/content/gdrive/MyDrive/Colab Notebooks/Project/data/train_set.tar'
valid_path = '/content/gdrive/MyDrive/Colab Notebooks/Project/data/val_set.tar'
unzip_tar_file(train_path, '/content')
unzip_tar_file(valid_path, '/content')
traindata = load_images_from_folder('train_set',4000,cnn = False)
print("Shape of training data: ")
print(np.shape(traindata))
valdata = load_images_from_folder('val_set',400,cnn = False)
print("Shape of validation data: ")
print(np.shape(valdata))
As shown in the plot below, the AffectNet data originally has many emotional categories, but we just need to sort these into "happy" and "non-happy", i.e. binarize the data. The data labels pre- and post-binarization are shown below.
labels_encoder = {0: "Neutral", 1: "Happy", 2: "Sad", 3:"Surprise",
4: "Fear", 5: "Disgust", 6: "Anger", 7: "Contempt"}
labels_encoder_binary = {0: "Non-Happy", 1: "Happy"}
# train set data visualization
labels_dir = 'train_set/annotations/'
labels_data = []
for filename in tqdm(os.listdir(labels_dir)):
if filename.endswith('_exp.npy'):
labels_data.append([filename.replace('_exp.npy','.jpg'), np.load(labels_dir+filename).item()])
labels_df = pd.DataFrame(labels_data)
labels_df[1] = labels_df[1].astype('int32')
print(labels_df.info())
# plot data architecture
sns.countplot(labels_df[1].map(labels_encoder))
After binarization, we removed excess samples to create a balanced dataset to avoid favouring the most common label. We also transformed 3D-RGB images from AffectNet to 3D-grayscale images because color does not matter for this problem. 1D-grayscale test samples were concatenated to create 3D-grayscale images for consistency.
Before training our classifier, we used our face detection algorithm to extract faces from all sample images. We considered this a stage of pre-processing for the classifier stage of the model since the face detection algorithm eliminated any invalid faces in the sample data. After preprocessing, the inputs to our classifier were 3D-grayscale images of recognizable faces. After the first face extraction stage of our model was applied, we obtained 52,500 images for training, 980 for validation and 1820 for testing, which was enough data for our problem.
A similar process was used for the validation data from AffectNet as well as the test data from other datasets mentioned below. The full code for data preprocessing can be seen in this notebook: https://github.com/katallzxc/mie1517/blob/main/FINAL/Binary_Classifier_Preprocessing.ipynb
Although we had partial code that created Viola-Jones features, our Viola-Jones implementation did not function correctly in time for the completion of this project, so we will leave this code out for brevity (though it is given in https://github.com/katallzxc/mie1517/blob/main/FINAL/FaceExtraction.ipynb). Instead, we used a Viola-Jones implementation provided by OpenCV.
The call to this package as well as the helper function that we wrote for face extraction are shown below.
#link to face weight for viola jones algorithm using opencv
import cv2
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
# extract human faces in an image as ROI using trained viola jones weights
def Face_Extraction(image_path):
# read the input image
image_input = np.array(Image.open(image_path))
image = cv2.imread(image_path,0)
# find face locations
face_locations = face_cascade.detectMultiScale(image, 1.3, 5)
# save extracted face image in dictionary
# {face_num: face_image_array}
face_num = 1
face_dict = {}
for (x, y, w, h) in face_locations:
face_dict[face_num] = (image_input[y:y+h, x:x+w], (x, y, w, h))
face_num += 1
return face_dict
For emotion classification, we built a CNN model which takes AlexNet features for each training face image as input, and passes these features to a 3-layer fully connected classifier for classification. We first tried to classify multi-class facial expressions to see if we could generate multiple types of facial expression; however, our best validation accuracy in this case was a paltry 36%. We noticed some special facial expression cases like the contemptuous and fearful expressions were difficult to distinguish from the happy faces, and by examining the confusion matrix, we found the model could classify happy faces vs. nonhappy faces fairly well. Since most group photo use cases only need smiling faces and not any other expression anyway, we decided to use a binary classifier. We used the BCEwithLogitsLoss function and the Adam optimizer.
Our CNN training code is given in this notebook: https://github.com/katallzxc/mie1517/blob/main/FINAL/Binary_Classifier_Training.ipynb
The architecture for our model is shown below.
# define CNN
class HappyFaceClassifier_alex(nn.Module):
def __init__(self):
super(HappyFaceClassifier_alex, self).__init__()
self.name = "HFC_alex"
self.fc1 = nn.Linear(256*6*6, 1024)
self.fc2 = nn.Linear(1024, 512)
self.fc3 = nn.Linear(512, 1)
def forward(self, x):
x = x.view(-1, 256*6*6)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
x = x.squeeze(1) # Flatten to [batch_size]
return x
During training, the model hyperparameters were tuned and the best performance was achieved using:
The weights from this training process are imported below.
#download weight for CNN
!gdown --id '1wn9D-aJDGtboO8eZFZHlxW4KS5njz_3j'
Our model can identify happy and unhappy faces using this trained classifier as well as the helper functions Face_Classifier. The Happy_Face_Recognition function combines the face detection, extraction, and classification processes to find all faces in a group photo and show which are happy and which are unhappy.
def Face_Classifier(image_array):
# define image
image = torch.tensor(np.transpose(image_array, (2, 1, 0)))
# resize image to fit model (3x224x224)
transform = transforms.Compose([transforms.Resize(224),
transforms.Grayscale(3)])
image = transform(image).unsqueeze(dim=0)
# define trained models
model = HappyFaceClassifier_alex()
pretrained_model = torchvision.models.alexnet(pretrained=True)
model_path = '/content/model_HFC_alex_bs512_lr0.005_epoch8'
state = torch.load(model_path)
model.load_state_dict(state)
# feature extraction
features = pretrained_model.features(image/255)
# model prediction
output = model(features)
pred = (output > 0.0).squeeze().long()
if pred == 1:
print("This person is happy.")
return True
else:
print("This person is not happy.")
return False
def Happy_Face_Recognition(image_path):
# display image
image_input = Image.open(image_path)
plt.imshow(np.asarray(image_input))
plt.show()
# extract faces in the image
face_dict = Face_Extraction(image_path)
nonhappy_face_dict = {}
# loop over extracted faces
for face_num in face_dict:
face_image_array = face_dict[face_num][0]
# show extracted faces
plt.imshow(face_image_array)
plt.show()
# recognize happy facial expression
if Face_Classifier(face_image_array) == False:
nonhappy_face_dict[face_num] = (face_image_array, face_dict[face_num][1])
return nonhappy_face_dict
Our first attempt used a GAN to generate smiles, but even after many trials of tuning the architecture, the result was not recognizable as a face, let alone a smiling face. Converting part of the facial structure while maintaining the overall facial identity was just impossible for us with a simple GAN. Instead, we had to learn a good embedding as in the lecture, but the problem at hand was not decorations but rather human anatomy. It was impossible to train this with our computational resources, so instead of training a model from scratch, we decided to take advantage of the generator from a project called GANimation.
Ideally, to get a convincing output smile, we would have a dataset that contains many different expressions from the same person, allowing the facial expression to change which the facial structure stays the same. This, however, is impossible to find, since many datasets are not guaranteed to have multiple pictures of the same person, and even if they do, that person's facial expressions would likely be oriented in different directions.
Instead we decided to use GANimation to essentially create a database for ourselves. GANimation is based on the OpenFace project, which extracts anatomical features from human faces to enable exchange of facial expression analysis metrics called action units (AUs) between different human faces. These AUs can be treated like a very good embedding. We used GANimation to exchange the AU features between a target unsmiling picture and a smiling picture of another person so that we could obtain a database of the same person with different facial expressions. This manufactured dataset could then be used to train our autoencoder
Our autoencoder architecture is shown below. It has six convolutional layers and uses an MSE loss function in training. Full training and set-up of this autoencoder are shown in this notebook: https://github.com/katallzxc/mie1517/blob/main/FINAL/Autoencoder_Training.ipynb
#define autoencoder
class Autoencoder(nn.Module):
def __init__(self):
super(Autoencoder, self).__init__()
self.encoder = nn.Sequential( # like the Composition layer you built
nn.Conv2d(3, 8, 3, stride=2, padding=1),
nn.ReLU(),
nn.Conv2d(8, 16, 3, stride=2, padding=1),
nn.ReLU(),
nn.Conv2d(16, 32, 3, stride=2, padding=1),
nn.ReLU(),
nn.Conv2d(32, 64, 3, stride=2,padding=1),
nn.ReLU(),
nn.Conv2d(64, 128, 3, stride=2,padding=1),
nn.ReLU(),
nn.Conv2d(128, 256, 7)
)
self.decoder = nn.Sequential(
nn.ConvTranspose2d(256, 128, 7),
nn.ReLU(),
nn.ConvTranspose2d(128, 64, 3, stride=2, padding=1, output_padding=1),
nn.ReLU(),
nn.ConvTranspose2d(64, 32, 3, stride=2, padding=1, output_padding=1),
nn.ReLU(),
nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1, output_padding=1),
nn.ReLU(),
nn.ConvTranspose2d(16, 8, 3, stride=2, padding=1, output_padding=1),
nn.ReLU(),
nn.ConvTranspose2d(8, 3, 3, stride=2, padding=1, output_padding=1),
nn.Sigmoid()
)
def forward(self, x):
x = self.encoder(x)
x = self.decoder(x)
return x
During training, the model hyperparameters were tuned and the best performance was achieved using:
The weights from this training process are imported below.
#download weight for Autoencoder
!gdown --id '1DNi6kKXQ9xfa_z4AmO1FnSTYYM2AEwKJ'
The Transfer_Face helper function below loads the autoencoder model and generates a smiling face from an inputted unsmiling face. The Replace_Face function then puts the smiling face in the original picture to replace the original unsmiling face.
def Transfer_Face(nonhappy_face_dict):
#load mautoencoder model
model = Autoencoder()
model.load_state_dict(torch.load('/content/model_weights.pth'))
fake_face_dict = {}
for face_num in nonhappy_face_dict:
#resize image to 224*224
face_image_array = nonhappy_face_dict[face_num][0]
pic = Image.fromarray(face_image_array)
pic = pic.resize((224, 224), Image.ANTIALIAS)
pic.save('my1.jpg')
face_image_array2= plt.imread('my1.jpg')
# transfer faces
fake_image_array = np.transpose(np.asarray(model(torch.tensor(np.transpose(face_image_array2,[2,1,0])/255)[None,...].float()).cpu().detach()),[0,3,2,1])
fake_image_array=fake_image_array[0]* 255
fake_image_array= fake_image_array.astype(np.uint8)
pic = Image.fromarray(fake_image_array)
pic = pic.resize((face_image_array.shape[1], face_image_array.shape[0]), Image.ANTIALIAS)
pic.save('my.jpg')
fake_image_array= plt.imread('my.jpg')
fake_face_dict[face_num] = (fake_image_array, nonhappy_face_dict[face_num][1])
return fake_face_dict
def Replace_Face(fake_face_dict, image_path):
# read the input image
image = np.array(Image.open(image_path))
for face_num in fake_face_dict:
face_image_array = fake_face_dict[face_num][0]
# replace face images
(x, y, w, h) = fake_face_dict[face_num][1]
image[y:y+h, x:x+w] = face_image_array
plt.imshow(image)
plt.show()
return Image.fromarray(image)
The data from the autoencoder is mostly qualitative, but the classifier can be analyzed in terms of its overall accuracy. The accuracy training curve is shown below. Our final classifier model is 90% accurate on train data, 84% accurate on validation data, and 70% accurate on new test data
Qualitatively, the classifier does fairly well. You can see some successful test cases shown below at the top and some failed cases shown at the bottom. The child at the bottom left is clearly NOT happy. However, the other case, on the bottom right, shows a woman whose expression might be confused for an unhappy contemptuous expression since her smile does not reach her eyes.
Our autoencoder inputs are images of an unsmiling person and the outputs are images of the same person with an automatically generated smile. In most cases, the output has an expression that is visibly more like a smile than the expression in the input image. The outputs are highly blurry, but the smile looks natural in many cases if the replaced face image is part of a low-resolution image or is small relative to the overall image.
The new data for this example is the image file test_1.jpg, which should be uploaded to Colab from the data Google Drive folder/this GitHub link: https://github.com/katallzxc/mie1517/blob/main/FINAL/test_1.jpg
The Happy_Face_Recognition function is used to extract and classify all faces, then the autoencoder generates a smiling picture for each nonhappy face in the Transfer_Face function, then the fake faces are put back into the photo in the Replace_Face function.
# Input: image directory
image_path = '/content/test_1.jpg'
# Apply happy face recognition
nonhappy_face_dict = Happy_Face_Recognition(image_path)
fake_face_dict = Transfer_Face(nonhappy_face_dict)
Replace_Face(fake_face_dict, image_path)
The fake generated faces are very blurry, but we can see a detectable smile on each face. We'll also try the code on the picture with two kids that we showed as an example at the start of this guide.
# Input: image directory
image_path = '/content/test_2.jpg'
# Apply happy face recognition
nonhappy_face_dict = Happy_Face_Recognition(image_path)
fake_face_dict = Transfer_Face(nonhappy_face_dict)
Replace_Face(fake_face_dict, image_path)
Comparing our work to the GANimation results for smile transfer, we can see that our transferred smiles are much less crisp than the GANimation smiles. Our project would be better suited for use in low-resolution group photos where each face is already quite small and the blurriness is not as evident, whereas the GANimation results would work well for close-up shots.
We think our model is performing generally well. It would work well for low-resolution images or large group photos where each individual face is very small. The images are very blurry, but this was not surprising given that this is a known effect of the MSELoss loss function used in autoencoders. Our fails to create a smile in cases where the mouth of the person is obscured, which was interesting. Generally we are satifisfied with the results of our model, since it does create visible smiles on pictures of unsmiling people.
%%shell
jupyter nbconvert --to html /MIE1517Team2.ipynb